Descriptive statistics

fake(1) and real(0) news summary statistics



What about Clinton?

## # A tibble: 2 x 2
##    fake   tot
##   <dbl> <int>
## 1     0    42
## 2     1     5

Clinton was mentioned 42 times in real news and only 5 times in fake news.



Main sources of fake and real news?

Top real news sources

## # A tibble: 10 x 2
##    source                              n
##    <chr>                           <int>
##  1 http://politi.co                   43
##  2 http://cnn.it                      22
##  3 http://abcn.ws                      9
##  4 http://occupydemocrats.com          9
##  5 http://eaglerising.com              5
##  6 http://www.addictinginfo.org        5
##  7 http://author.addictinginfo.org     4
##  8 http://rightwingnews.com            4
##  9 http://conservativebyte.com         3
## 10 http://freedomdaily.com             3

Top fake news sources

## # A tibble: 10 x 2
##    source                               n
##    <chr>                            <int>
##  1 <NA>                                29
##  2 http://uspoln.com                    6
##  3 http://thelastlineofdefense.org      4
##  4 http://freedomcrossroads.us          3
##  5 http://departedmedia.com             2
##  6 http://newsfeedhunter.com            2
##  7 http://politicono.com                2
##  8 http://politicot.com                 2
##  9 http://thenewyorkevening.com         2
## 10 http://undergroundnewsreport.com     2

Correlation tests

What is the similarity between words in fake vs real news. Word correlation between those in fake vs real news.

Pearson Correlation of 87%

##           0         1
## 0 1.0000000 0.8658092
## 1 0.8658092 1.0000000

Visualizing word correlations

Pairwise correlation



Bigram correlation (tf-idf)

These words are, as measured by tf-idf, the most important bigrams in each news type, meaning these are the top phrases thatmost distinguish fake news from real news articles.


unigram correlation (tf-idf)

This is the unigram version of the plot above. These particular words are the most important to each respective news type.


trigram correlation (tf-idf)

Sentiment analysis

Average ‘afinn’ sentiment score between real (0) and fake(1) news sources

## # A tibble: 2 x 2
##    fake sum_sent
##   <dbl>    <dbl>
## 1     0     -646
## 2     1     -899
## # A tibble: 2 x 2
##    fake sum_avg
##   <dbl>   <dbl>
## 1     0   -10.4
## 2     1   -21.0
## # A tibble: 2 x 2
##    fake avg_sent
##   <dbl>    <dbl>
## 1     0   -0.223
## 2     1   -0.465

first table: sum all sentiment values by group

second table: sum of afinn values by article, average by news type

Not a strong consensus on how to compare aggreagte sentiment scores. In many applications people use averages because documents can vary in length. “If you have a very long document you might see more positive or negative words, but this can simply be a function of having more words overall. If your document lengths are similar, then summing might make more sense” (Prof. Mitts).

This is a good example though on how the interpretation of sentiment analysis results varies with methodology.


For the three plots below:

Each line represents the average AFINN score by article.

Each line represents the sum of AFINN scores by article.

Each line represents the net BING score by article (positive words counts - negative word counts)


‘Not’ words

words that contributed to the wrong sentiment direction


On average, what proportion of words in an article hold particular sentiment?

Disgust

## # A tibble: 2 x 3
## # Groups:   fake [2]
##    fake mean_word_sent_disgust mean_sent_prop_disgust
##   <dbl>                  <dbl>                  <dbl>
## 1     1                   5.80                 0.028 
## 2     0                   7.01                 0.0256
  • mean_word_sent – average number of ‘disgust’ words per article by news type
  • mean_sent_prop – average total proportion of words in an article that hold ‘digest’ sentiment


unigram correlation (tf-idf)

These words are the most important ‘disgust’ unigrams to each respective news type.


Fear

## # A tibble: 2 x 3
## # Groups:   fake [2]
##    fake mean_word_sent_fear avg_prop_fear
##   <dbl>               <dbl>         <dbl>
## 1     1                10.4        0.0502
## 2     0                17.0        0.0549

mean_word_sent – average number of ‘fear’ words in an article avg_prop_fear – average proportion of ‘fear’ words in an article


unigram correlation (tf-idf)

These words are the most important ‘fear’ unigrams to each respective news type.


Joy

## # A tibble: 2 x 3
## # Groups:   fake [2]
##    fake mean_word_sent_joy mean_sent_prop_joy
##   <dbl>              <dbl>              <dbl>
## 1     1               6.70             0.0218
## 2     0               8.08             0.029
  • mean_word_sent – average number of ‘joy’ words per article by news type
  • mean_sent_prop – average total proportion of words in an article that hold ‘digest’ sentiment


unigram correlation (tf-idf)

These words are the most important ‘joy’ unigrams to each respective news type.


Negative

## # A tibble: 2 x 3
## # Groups:   fake [2]
##    fake mean_word_sent_negative mean_sent_prop_negative
##   <dbl>                   <dbl>                   <dbl>
## 1     1                    15.9                  0.0771
## 2     0                    24.5                  0.08
  • mean_word_sent – average number of ‘joy’ words per article by news type
  • mean_sent_prop – average total proportion of words in an article that hold ‘digest’ sentiment


unigram correlation (tf-idf)

These words are the most important ‘negative’ unigrams to each respective news type.


Anger

## # A tibble: 2 x 3
## # Groups:   fake [2]
##    fake mean_word_sent_anger mean_sent_prop_anger
##   <dbl>                <dbl>                <dbl>
## 1     1                 7.67               0.0406
## 2     0                14.0                0.0561
  • mean_word_sent – average number of ‘joy’ words per article by news type
  • mean_sent_prop – average total proportion of words in an article that hold ‘digest’ sentiment


unigram correlation (tf-idf)

These words are the most important ‘disgust’ unigrams to each respective news type.


Sadness

## # A tibble: 2 x 3
## # Groups:   fake [2]
##    fake mean_word_sent_sadness mean_sent_prop_sadness
##   <dbl>                  <dbl>                  <dbl>
## 1     1                   7.74                 0.0425
## 2     0                  12.4                  0.0432
  • mean_word_sent – average number of ‘joy’ words per article by news type
  • mean_sent_prop – average total proportion of words in an article that hold ‘digest’ sentiment


unigram correlation (tf-idf)

These words are the most important ‘disgust’ unigrams to each respective news type.


Summarizing table of NRC sentiments

Table includes:

  • average number of X sentiment words per article by news type
  • average total proportion of words in an article that hold X sentiment

  • NRC sentiments here: fear, sadness, disgust, anger, negative, joy